Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Interactive layout analysis, content extraction, and transcription of historical printed books using Pattern Redundancy Analysis

Identifieur interne : 000057 ( France/Analysis ); précédent : 000056; suivant : 000058

Interactive layout analysis, content extraction, and transcription of historical printed books using Pattern Redundancy Analysis

Auteurs : Jean-Yves Ramel [France] ; Nicolas Sidere [France] ; Frédéric Rayar [France]

Source :

RBID : Francis:14-0182616

Descripteurs français

English descriptors

Abstract

This article describes the work performed in the Pattern Redundancy Analysis for Document Image Indexing and Transcription research project. The project focused on layout analysis, text/graphics separation, optical character recognition (OCR), and text transcription processes dedicated to old and precious books. The originality of this work relies on the analysis and exploitation of pattern redundancy in documents to enable the efficient indexing and quick transcription of books and the identification of typographic materials. For these purposes, we have developed two software packages. The first, AGORA, performs page layout analysis, text/graphics separation, and pattern (letterform) extraction simultaneously. These patterns are then processed to group similar patterns together in single clusters so that different letterforms of a book can be extracted and analysed to compute redundancy rates. This process allows a significant reduction of the number of letterforms to be recognized. Once the clustering of letterforms is done, a user may assign a label to each cluster using the second software, RETRO. Labels are then automatically assigned to each corresponding character to perform the text transcription of the whole book. Thus, if 90% of the letterforms are detected as redundant, only one character out of ten must be labelled by the user to transcribe the book. Moreover, this transcription method allows us to deal easily with the special characters that appear frequently in old books. It is also possible to use our clustering approach to extract and create new font packages from specific printing material (e.g. from rare books printed with particular types or woodblocks). These new font packages could be incorporated into the training step of optical fonts recognition methods to improve the recognition results of OCRs on rare or specific books. The identification of typographic materials could also be useful for the study of both the aesthetic (such as how the thickness and shape of printing types evolved from the 15th to the mid-16th century) and economic aspects of printing historically. Until the second half of the 16th century, for instance, printing types circulated among workshops, and printers frequently sold or lent types to their fellows.

Url:


Affiliations:


Links toward previous steps (curation, corpus...)


Links to Exploration step

Francis:14-0182616

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">Interactive layout analysis, content extraction, and transcription of historical printed books using Pattern Redundancy Analysis</title>
<author>
<name sortKey="Ramel, Jean Yves" sort="Ramel, Jean Yves" uniqKey="Ramel J" first="Jean-Yves" last="Ramel">Jean-Yves Ramel</name>
<affiliation wicri:level="4">
<inist:fA14 i1="01">
<s1>Laboratoire d'Informatique de Tours (EA 6300) Ecole d'ingénieurs Polytechnique de l'Université de Tours</s1>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>France</country>
<placeName>
<settlement type="city">Tours</settlement>
<region type="old region" nuts="2">Région Centre</region>
<region type="region" nuts="2">Centre-Val de Loire</region>
</placeName>
<orgName type="university">Université François-Rabelais de Tours</orgName>
<orgName type="institution" wicri:auto="newGroup">Centre Val de Loire Université</orgName>
</affiliation>
</author>
<author>
<name sortKey="Sidere, Nicolas" sort="Sidere, Nicolas" uniqKey="Sidere N" first="Nicolas" last="Sidere">Nicolas Sidere</name>
<affiliation wicri:level="4">
<inist:fA14 i1="01">
<s1>Laboratoire d'Informatique de Tours (EA 6300) Ecole d'ingénieurs Polytechnique de l'Université de Tours</s1>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>France</country>
<placeName>
<settlement type="city">Tours</settlement>
<region type="old region" nuts="2">Région Centre</region>
<region type="region" nuts="2">Centre-Val de Loire</region>
</placeName>
<orgName type="university">Université François-Rabelais de Tours</orgName>
<orgName type="institution" wicri:auto="newGroup">Centre Val de Loire Université</orgName>
</affiliation>
</author>
<author>
<name sortKey="Rayar, Frederic" sort="Rayar, Frederic" uniqKey="Rayar F" first="Frédéric" last="Rayar">Frédéric Rayar</name>
<affiliation wicri:level="4">
<inist:fA14 i1="01">
<s1>Laboratoire d'Informatique de Tours (EA 6300) Ecole d'ingénieurs Polytechnique de l'Université de Tours</s1>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>France</country>
<placeName>
<settlement type="city">Tours</settlement>
<region type="old region" nuts="2">Région Centre</region>
<region type="region" nuts="2">Centre-Val de Loire</region>
</placeName>
<orgName type="university">Université François-Rabelais de Tours</orgName>
<orgName type="institution" wicri:auto="newGroup">Centre Val de Loire Université</orgName>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">14-0182616</idno>
<date when="2013">2013</date>
<idno type="stanalyst">FRANCIS 14-0182616 INIST</idno>
<idno type="RBID">Francis:14-0182616</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000035</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000752</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000037</idno>
<idno type="wicri:doubleKey">0268-1145:2013:Ramel J:interactive:layout:analysis</idno>
<idno type="wicri:Area/Main/Merge">000197</idno>
<idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:hal-01022631</idno>
<idno type="url">https://hal.archives-ouvertes.fr/hal-01022631</idno>
<idno type="wicri:Area/Hal/Corpus">000070</idno>
<idno type="wicri:Area/Hal/Curation">000070</idno>
<idno type="wicri:Area/Hal/Checkpoint">000050</idno>
<idno type="wicri:doubleKey">0268-1145:2013:Ramel J:interactive:layout:analysis</idno>
<idno type="wicri:Area/Main/Merge">000155</idno>
<idno type="wicri:Area/Main/Curation">000194</idno>
<idno type="wicri:Area/Main/Exploration">000194</idno>
<idno type="wicri:Area/France/Extraction">000057</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">Interactive layout analysis, content extraction, and transcription of historical printed books using Pattern Redundancy Analysis</title>
<author>
<name sortKey="Ramel, Jean Yves" sort="Ramel, Jean Yves" uniqKey="Ramel J" first="Jean-Yves" last="Ramel">Jean-Yves Ramel</name>
<affiliation wicri:level="4">
<inist:fA14 i1="01">
<s1>Laboratoire d'Informatique de Tours (EA 6300) Ecole d'ingénieurs Polytechnique de l'Université de Tours</s1>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>France</country>
<placeName>
<settlement type="city">Tours</settlement>
<region type="old region" nuts="2">Région Centre</region>
<region type="region" nuts="2">Centre-Val de Loire</region>
</placeName>
<orgName type="university">Université François-Rabelais de Tours</orgName>
<orgName type="institution" wicri:auto="newGroup">Centre Val de Loire Université</orgName>
</affiliation>
</author>
<author>
<name sortKey="Sidere, Nicolas" sort="Sidere, Nicolas" uniqKey="Sidere N" first="Nicolas" last="Sidere">Nicolas Sidere</name>
<affiliation wicri:level="4">
<inist:fA14 i1="01">
<s1>Laboratoire d'Informatique de Tours (EA 6300) Ecole d'ingénieurs Polytechnique de l'Université de Tours</s1>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>France</country>
<placeName>
<settlement type="city">Tours</settlement>
<region type="old region" nuts="2">Région Centre</region>
<region type="region" nuts="2">Centre-Val de Loire</region>
</placeName>
<orgName type="university">Université François-Rabelais de Tours</orgName>
<orgName type="institution" wicri:auto="newGroup">Centre Val de Loire Université</orgName>
</affiliation>
</author>
<author>
<name sortKey="Rayar, Frederic" sort="Rayar, Frederic" uniqKey="Rayar F" first="Frédéric" last="Rayar">Frédéric Rayar</name>
<affiliation wicri:level="4">
<inist:fA14 i1="01">
<s1>Laboratoire d'Informatique de Tours (EA 6300) Ecole d'ingénieurs Polytechnique de l'Université de Tours</s1>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>France</country>
<placeName>
<settlement type="city">Tours</settlement>
<region type="old region" nuts="2">Région Centre</region>
<region type="region" nuts="2">Centre-Val de Loire</region>
</placeName>
<orgName type="university">Université François-Rabelais de Tours</orgName>
<orgName type="institution" wicri:auto="newGroup">Centre Val de Loire Université</orgName>
</affiliation>
</author>
</analytic>
<series>
<title level="j" type="main">Literary and linguistic computing</title>
<title level="j" type="abbreviated">Lit. linguist. comput.</title>
<idno type="ISSN">0268-1145</idno>
<imprint>
<date when="2013">2013</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<title level="j" type="main">Literary and linguistic computing</title>
<title level="j" type="abbreviated">Lit. linguist. comput.</title>
<idno type="ISSN">0268-1145</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Automatic recognition</term>
<term>Computational linguistics</term>
<term>Electronic library</term>
<term>Electronic storage</term>
<term>Extraction</term>
<term>Graphics</term>
<term>Optical character recognition</term>
<term>Text</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Linguistique informatique</term>
<term>Extraction</term>
<term>Texte</term>
<term>Représentation graphique</term>
<term>Reconnaissance optique caractère</term>
<term>Reconnaissance automatique</term>
<term>Bibliothèque électronique</term>
<term>Archivage électronique</term>
<term>Humanités numériques</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">This article describes the work performed in the Pattern Redundancy Analysis for Document Image Indexing and Transcription research project. The project focused on layout analysis, text/graphics separation, optical character recognition (OCR), and text transcription processes dedicated to old and precious books. The originality of this work relies on the analysis and exploitation of pattern redundancy in documents to enable the efficient indexing and quick transcription of books and the identification of typographic materials. For these purposes, we have developed two software packages. The first, AGORA, performs page layout analysis, text/graphics separation, and pattern (letterform) extraction simultaneously. These patterns are then processed to group similar patterns together in single clusters so that different letterforms of a book can be extracted and analysed to compute redundancy rates. This process allows a significant reduction of the number of letterforms to be recognized. Once the clustering of letterforms is done, a user may assign a label to each cluster using the second software, RETRO. Labels are then automatically assigned to each corresponding character to perform the text transcription of the whole book. Thus, if 90% of the letterforms are detected as redundant, only one character out of ten must be labelled by the user to transcribe the book. Moreover, this transcription method allows us to deal easily with the special characters that appear frequently in old books. It is also possible to use our clustering approach to extract and create new font packages from specific printing material (e.g. from rare books printed with particular types or woodblocks). These new font packages could be incorporated into the training step of optical fonts recognition methods to improve the recognition results of OCRs on rare or specific books. The identification of typographic materials could also be useful for the study of both the aesthetic (such as how the thickness and shape of printing types evolved from the 15th to the mid-16th century) and economic aspects of printing historically. Until the second half of the 16th century, for instance, printing types circulated among workshops, and printers frequently sold or lent types to their fellows.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>France</li>
</country>
<region>
<li>Centre-Val de Loire</li>
<li>Région Centre</li>
</region>
<settlement>
<li>Tours</li>
</settlement>
<orgName>
<li>Centre Val de Loire Université</li>
<li>Université François-Rabelais de Tours</li>
</orgName>
</list>
<tree>
<country name="France">
<region name="Région Centre">
<name sortKey="Ramel, Jean Yves" sort="Ramel, Jean Yves" uniqKey="Ramel J" first="Jean-Yves" last="Ramel">Jean-Yves Ramel</name>
</region>
<name sortKey="Rayar, Frederic" sort="Rayar, Frederic" uniqKey="Rayar F" first="Frédéric" last="Rayar">Frédéric Rayar</name>
<name sortKey="Sidere, Nicolas" sort="Sidere, Nicolas" uniqKey="Sidere N" first="Nicolas" last="Sidere">Nicolas Sidere</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/France/Analysis
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000057 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/France/Analysis/biblio.hfd -nk 000057 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    France
   |étape=   Analysis
   |type=    RBID
   |clé=     Francis:14-0182616
   |texte=   Interactive layout analysis, content extraction, and transcription of historical printed books using Pattern Redundancy Analysis
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024